*https://github.com/YuliyaSkakun/IODS-project*
Analysis of the regression output
This week I have started to learn how to read and analyse the data in order to work on the further analysis of it (regression analysis) in order to figure out the ways or reasons for the specific pattern that can be observed in the data. Also, there was an emphasis on the graphical representation of the data as it is usually facilitate the process of data analysis. All in all, tools and techniques acquired this week will help me to work with the data.
Tasks 1 and 2 in the analysis
Create summary of the variales
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
In this plot I will analyse the trend of the attitude of students (depending on their gender) to their exam results.
## Warning: package 'ggplot2' was built under R version 3.3.2
The regression line was added in order to be able to clearly see the trend and be able to compare attitude to points between genders. It is clearly seen that a positive correlation between attitude and points which is logical. The higher the grade the more the person is dedicated to studying and the more he or she is reacting to the grades (show his/her attitude).
Also I will analyse the relationship between variables by creating a scatterplot.
In order to have a brighter picture of the interrelationships, I will create more colourfull graph using ggplot function.
## Warning: package 'GGally' was built under R version 3.3.2
Analysing the plot, we can come to the conclusion that majority of people that were analysed are students of the age of 20. Less of the people that are more than 30 are analysed in this dataset. Therefore, the boxplot of the age variable is showing that the mean is around 20 and values that are higher are outliers.
Also, one can clearly see the interrelation between attitude and points that the student is achieving. These are positively correlated. Points and age, as well as surf and points are negatively correlated. However, the correlation coefficient in all of the case are relatively low. One cannot say that there can be observed a clear link between points and explanatory variables. There is observed a clear link between attitude and points: the higher the attitude, the more responsible the student is, the higher the grade he or she will score. Another point, is that I cannot clearly state that there is any interrelationship between points, attitude and type of question asked during the examination (deep, stra or surf). There are no specific characteristics that are pointing to the fact that males perform better that females or vice versa.
Task 3 in the analysis Create a regreesion.
reg1 <- lm(points ~ attitude + gender+ age, data = learning2014)
##
## Call:
## lm(formula = points ~ attitude + gender + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## attitude 0.36066 0.05932 6.080 8.34e-09 ***
## genderM -0.33054 0.91934 -0.360 0.720
## age -0.07586 0.05367 -1.414 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
The only variable that is statistacally significant at 1% level is the attitude. Therefore, one can say that in case attitude of the student increases by 1, points will be more likely to increase by 0.33. This result is in line with the p4 graph that was plotted and described previously. Explanatory variables gender and age are not statistically significant. Model explaines 20% of the variation in the dependent variable.
reg2 <- lm(points ~ attitude , data = learning2014)
Task 4 in the analysis
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Due to the elimination of the insignificant variables the attitude t-statically has slightly increased, as well as its estimate. Also, because variables had no impact on the points (dependent variable) the R-squared has decrease only by 1%. That is also indicating to the fact that eliminated variables had no effect on points.
Task 5 in the analysis
As the analysed model includes only one explanatory variable - attitude, one can intuition that the higher the attitude of the person to studies, the more time and effort he or she is spending on education. Therefore, this effect will defenitely be reflected in the points for the exam or test the person gets. Therefore, I do believe that between these variables should be obsereved a clear positive relation.
Residual vs.fitted values are showinng that in general the model tries to explain the variatoon in the depent varible. However, there are a lot of observations that are not capturing this. Therefore, one can conclude that there is a need to add additional variables in order to be ables to better explain the model (by minimising the distance of the squared residuals).
Normality plot is showing that the data is to the biggest extend follows the normal distribution. However, additional tests can be performed in order to prove it statistically.
Residual vs Leverage graph is showing that there is none of the variables that are the key in the explanation of the effect of points received by students. This is due to the fact that Cook’s distance is barely seen in the graph.
The data on the acohol consumption contains 382 observations of students that are taking math and Portugese courses. There are 35 main variables (binary, nominal and numeric ones) that can have some explanatory power on the main question of the analysis: what are the factors that affect the high alcohol consumption among students?. These are the general information about the student such as age, sex, amount of time they study, have freetime, whether student are in the romantic relationship, etc. Also, there are some information on the family such as family size or whether the child is receiving the family support. I do believe that these are the factors that may affect the behaviour of the student and respectivelly his or her attitude towards alcohol.
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Another variable of high interest to me is the amount of studytime. I assume that the alcohol consumpton and amount of hours the student dedicate to studying are negatively correlated. This may be due to the fact that the student may not have simply enough time to hang out with friends and drink too much. Also, this may account for the fact the these kinds of students are more self-consious. They may treat too much drinking as somethinhg that negatively affects their productivity and health. Therefore, they may tend to obstain from that
Also, another variable of interest is the romantic relationship. I do believe that those people in relationship are more self-consious and therefore less addicted to alcohol
Another fact that may affect the high alcohol consumption is the number of absences. It may be obvious that these two are correlated as the more time the person is absent from classes the more he or she may be discouraged from studies. Respectively, the less time he or she will be dedicating to studies.
Based on my intuition behind the possible effect the variables can have on alcohol consumption, I will create couple of plots in order to confirm my thoughts
The following boxplot is showing how the distribution of the studytime. It is showing that females that do not consume a lot of alcohol tend to study more - 2-3 hours, in general. For males, it is 1-2 hours. However, the mean value is equal for both group.
## Warning: package 'tidyr' was built under R version 3.3.2
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The following bar show that those who spend 1-2 hours on studies tend to have higher alcohol consumption.
However, when one considers romaic relationship, one can see a clear pattern: the fact tht the person is in relationship lowers the number of thsoe who have high alcohol consumption.
The tabulation proves the fact the males tend to have higer alocohol consumption that females
## Source: local data frame [4 x 3]
## Groups: sex [?]
##
## sex high_use count
## <fctr> <lgl> <int>
## 1 F FALSE 156
## 2 F TRUE 42
## 3 M FALSE 112
## 4 M TRUE 72
The following bar is quite controversial result to my mind: the less classes are skipped, the more the person tend to drink alcohol. The only explanation is that big number of classes causes stress that has to be relieved with alcohol.
In genral, there are less of those who do have a high number of alcohol consumption
## Observations: 382
## Variables: 1
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE...
## Observations: 382
## Variables: 2
## $ key <chr> "high_use", "high_use", "high_use", "high_use", "high_us...
## $ value <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
Logistic Regression
In order to be able to interpret and analyse the interelationships, I will perform a regression analysis.
##
## Call:
## glm(formula = high_use ~ romantic + absences + sex + studytime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1456 -0.8356 -0.6145 1.0895 2.0865
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.87199 0.41840 -2.084 0.03715 *
## romanticyes -0.20370 0.26174 -0.778 0.43643
## absences 0.09026 0.02314 3.900 9.61e-05 ***
## sexM 0.78878 0.25124 3.140 0.00169 **
## studytime -0.39477 0.15916 -2.480 0.01312 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 422.66 on 377 degrees of freedom
## AIC: 432.66
##
## Number of Fisher Scoring iterations: 4
## (Intercept) romanticyes absences sexM studytime
## -0.87198682 -0.20369623 0.09025851 0.78877844 -0.39477449
The results are suggesting that in the case you study by one more hour, the alcohol consumption will drop by 39%. That is a huge positive effect on the alcohol consumption
The increase in the number of absences will lead to the increase in the alcohol consumption by 9%
However, in order to have a better understanding of the variables, I will find the odds ratio
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.4181200 0.1829448 0.9470310
## romanticyes 0.8157101 0.4840025 1.3541213
## absences 1.0944572 1.0481820 1.1479059
## sexM 2.2007065 1.3501218 3.6224923
## studytime 0.6738320 0.4886211 0.9135379
Another point that has to be analysed is the predictive power of the model.
Therefore, as my model has insignificant variable - romantic, I will eliminate it and afterwards compute the confusion matrix and the respective probabilities.
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
##
## Call:
## glm(formula = high_use ~ absences + sex + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1771 -0.8487 -0.5971 1.0986 2.1157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.91782 0.41459 -2.214 0.02684 *
## absences 0.08876 0.02308 3.846 0.00012 ***
## sexM 0.79826 0.25065 3.185 0.00145 **
## studytime -0.40248 0.15930 -2.526 0.01152 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.27 on 378 degrees of freedom
## AIC: 431.27
##
## Number of Fisher Scoring iterations: 4
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 88 26
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.02617801 0.70157068
## TRUE 0.23036649 0.06806283 0.29842932
## Sum 0.90575916 0.09424084 1.00000000
The confusion matrix is showing that there are 258 true negatives, 10 true positives, 88 false negatives and 26 false positives. Tabulating the result and calculating probabilities shows that there are 67% of true negative, 2.6% of true positives, 23% of false negatives and 6.8% of false positives.
The result is also showed graphically.
The graph is showing that according to the odds ratio there is approximately 50% change of the values of beign negative and 50% of them being positive. Therefore it cannot be reliable.
Additionally, I would like to perfom a 10-K cross validation
## [1] 0.2565445
## [1] 0.2617801
My model is showing that on average there are 26% of prediction that are wrong. That is the same value as in the DataCamp model.
Therefore, I will perform a set of logistic regression in order to analyse the 10-K cross validation
First of all, I will include most of the variable in the model
##
## Call:
## glm(formula = high_use ~ famsup + absences + studytime + sex +
## age + address + famsize + higher + traveltime + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1239 -0.7986 -0.5848 0.9768 2.1090
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.59569 2.16358 -2.124 0.033661 *
## famsupyes 0.06416 0.25597 0.251 0.802084
## absences 0.08768 0.02355 3.723 0.000197 ***
## studytime -0.36195 0.16548 -2.187 0.028724 *
## sexM 0.67323 0.26264 2.563 0.010368 *
## age 0.14514 0.10922 1.329 0.183893
## addressU -0.28805 0.30793 -0.935 0.349560
## famsizeLE3 0.25654 0.26828 0.956 0.338948
## higheryes -0.04456 0.54725 -0.081 0.935103
## traveltime 0.30840 0.17565 1.756 0.079135 .
## freetime 0.28943 0.12728 2.274 0.022971 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 408.83 on 371 degrees of freedom
## AIC: 430.83
##
## Number of Fisher Scoring iterations: 4
## [1] 0.2591623
## [1] 0.2696335
##
## Call:
## glm(formula = high_use ~ famsup + absences + studytime + sex +
## age + address + famsize + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0577 -0.8005 -0.5855 0.9991 2.0911
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14436 1.91942 -2.159 0.030837 *
## famsupyes 0.07655 0.25472 0.301 0.763790
## absences 0.08523 0.02329 3.659 0.000253 ***
## studytime -0.39277 0.16406 -2.394 0.016663 *
## sexM 0.68382 0.26043 2.626 0.008648 **
## age 0.15574 0.10605 1.469 0.141966
## addressU -0.45940 0.29001 -1.584 0.113172
## famsizeLE3 0.28751 0.26609 1.081 0.279914
## freetime 0.28023 0.12644 2.216 0.026664 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 411.91 on 373 degrees of freedom
## AIC: 429.91
##
## Number of Fisher Scoring iterations: 4
## [1] 0.2643979
## [1] 0.2670157
The elimination of the insignificant variables have decrease the wrong prediction probability. However, there are still some insignificant variables left
##
## Call:
## glm(formula = high_use ~ famsup + absences + studytime + sex +
## address + freetime, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1126 -0.8263 -0.5977 1.0191 2.1741
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.41943 0.63853 -2.223 0.026218 *
## famsupyes 0.01140 0.25078 0.045 0.963728
## absences 0.08971 0.02310 3.883 0.000103 ***
## studytime -0.38755 0.16170 -2.397 0.016540 *
## sexM 0.70197 0.25815 2.719 0.006544 **
## addressU -0.50961 0.28317 -1.800 0.071912 .
## freetime 0.27574 0.12591 2.190 0.028524 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 415.27 on 375 degrees of freedom
## AIC: 429.27
##
## Number of Fisher Scoring iterations: 4
## [1] 0.2617801
## [1] 0.2643979
The more variables eliminated, the hight the probability of the error
##
## Call:
## glm(formula = high_use ~ +absences + studytime + sex + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9628 -0.8429 -0.5886 1.0710 2.1255
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.81622 0.59085 -3.074 0.00211 **
## absences 0.09063 0.02290 3.957 7.58e-05 ***
## studytime -0.38257 0.16018 -2.388 0.01693 *
## sexM 0.71513 0.25461 2.809 0.00497 **
## freetime 0.27181 0.12526 2.170 0.03001 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 418.45 on 377 degrees of freedom
## AIC: 428.45
##
## Number of Fisher Scoring iterations: 4
## [1] 0.2591623
## [1] 0.2617801
Therefore, the elimination of the variables have led to the increase in the probability of error.
I have loaded the Boston dataset that contains information on housing values in suburbs of Boston. Data has 506 observations and 14 differeent variables like crime rate of the town, number of rooms in dwelling, pupil-teacher ratio, proportion of the lower status population, etc. All these variables are the key that help to eveluate the value of the houses in this area.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506 14
The graphical representation of the variables show that in some cases there exist a strong correlation between variables; also the accumaulation tends to be close to the edge.
## function (x, ...)
## UseMethod("pairs")
## <bytecode: 0x7ff30c4e5b18>
## <environment: namespace:graphics>
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
However, I will also plot the correlation matrix in order to explore the data in more details.
## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
## select(): dplyr, MASS
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
The correlogram is giving a more comprehensive picture of the correlation between the variables. Therefore, one can clearly observe a negative correlation between indus/dis (proportion of non-retail business acres per town to weighted mean of distances to five Boston employment centres), nox/dis (nitrogen oxides concentration to weighted mean of distances to five Boston employment centres), age/dis (proportion of owner-occupied units built prior to 1940 to weighted mean of distances to five Boston employment centres) and lstat/medv (lower status of the population to median value of owner-occupied homes). Positive correlation is observed in indus/nox (proportion of non-retail business acres per town to nitrogen oxides concentration), rad/tax (index of accessibility to radial highways to full-value property-tax rate per $10,000), nox/age. Having a closer look at the correlations, these seem to be logical.
However, in order to be able to classify the variable, it has to be standirdized so that it is comparable.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
## [1] "matrix"
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Standardization of the variable has led to the fact that the range of the variable have decrease. Therefore, this standardized variable will be used in the further analysis
Also, I will create yet another categorical variable crime that will be created from the continuous one crim. I will remove crim variable from the dataset so that it does not affect the further analysis
## crime
## low med_low med_high high
## 127 126 126 127
After the necessary transformation, I will divide the data by the train (contain 80% of the data) and the test one (20% of the data) in order to proceed with the Linear Discrimination Analysis.
Now, I will fit the linear discriminant analysis on the train set.I will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. LDA will find a combination of the explanatory variable in such way so that it can separate the classes of the crime variable the best
## [1] "matrix"
## crime
## low med_low med_high high
## 127 126 126 127
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2376238 0.2524752 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 0.94393451 -0.8896511 -0.08835242 -0.8721307 0.442220077
## med_low -0.09196659 -0.3129709 0.05576262 -0.5929228 -0.091249630
## med_high -0.38551208 0.2126149 0.11366115 0.4134773 0.001106402
## high -0.48724019 1.0171737 -0.03371693 1.0807451 -0.404358387
## age dis rad tax ptratio
## low -0.8631064 0.8641915 -0.6909978 -0.7306067 -0.4020826
## med_low -0.3263492 0.3582813 -0.5523924 -0.5345488 -0.1252496
## med_high 0.4019457 -0.3903371 -0.4459199 -0.3230911 -0.2502641
## high 0.7990652 -0.8438398 1.6375616 1.5136504 0.7801170
## black lstat medv
## low 0.37618771 -0.76536144 0.52646572
## med_low 0.31303365 -0.13721126 0.01455462
## med_high 0.04002639 0.08902882 0.08160471
## high -0.83278486 0.81915889 -0.63376100
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.1147219986 0.630866665 -0.9434083
## indus 0.0263096758 -0.248793622 0.3789990
## chas -0.0178330869 0.002004403 0.1592167
## nox 0.3853405689 -0.664661522 -1.5031986
## rm -0.0003635623 -0.048155694 -0.1108547
## age 0.1987290099 -0.247982624 -0.0355141
## dis -0.0994479581 -0.118275818 0.1626831
## rad 3.2622286905 1.129429418 0.3320167
## tax 0.1243245393 -0.197558674 0.2127400
## ptratio 0.1754578349 0.040513806 -0.4403093
## black -0.1064666436 0.028259542 0.1134470
## lstat 0.2068943501 -0.277587123 0.3619926
## medv 0.1310537275 -0.364228288 -0.2501465
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9531 0.0351 0.0118
In order to see the full picture of the obtained results, I have plotted a graph where different colours are resposible for different clsses of the variables. The arrow indicates the impact of each of the predictor variable in the model. From that, one can clearly see that rad (index of accessibility to radial highways) has the longest arrow and respectively impact.
Now, I will remove the crime from the data and will make a prediction for the new dataset.
## crime
## low med_low med_high high
## 127 126 126 127
## predicted
## correct low med_low med_high high
## low 15 5 0 0
## med_low 2 10 9 0
## med_high 1 6 25 0
## high 0 0 0 29
The data show that prediction for the high and low crime rates are correct ones. However, the prediction for the medium crime rates are not always correct
Analysing the data from another angel, I will cluster observation and perform the k-means model that will asign cluster based on the distance between variables. Distance between the variablesis is a measure of its similarity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.620 170.500 226.300 371.900 626.000
The plot is showing the scaled pairs that are plotted against each other
## [1] 404 13
## [1] 13 3
## Warning: package 'plotly' was built under R version 3.3.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Also, here is another 3D plot where the color is defined by the clusters of the k-mean. It shows the same classes as in LDA model, however, without classification by the crime rate.
As a bonus I have decide to perform k-means on the original Boston data. I have taken the clusters variable as the target class.
## crime
## low med_low med_high high
## 127 126 126 127
## Call:
## lda(km2$cluster ~ ., data = Boston)
##
## Prior probabilities of groups:
## 1 2 3
## 0.1996047 0.5237154 0.2766798
##
## Group means:
## crim zn indus chas nox rm age
## 1 0.7491682 10.49505 12.800396 0.05940594 0.5798416 6.189772 73.15743
## 2 0.2323824 17.69811 6.666981 0.07547170 0.4831913 6.468596 55.55623
## 3 12.0799686 0.00000 18.397286 0.06428571 0.6719000 6.004857 89.91143
## dis rad tax ptratio black lstat medv
## 1 3.394095 4.801980 403.5743 17.73465 369.2717 12.875941 22.20693
## 2 4.867279 4.316981 276.0377 17.84943 388.9088 9.440679 25.97019
## 3 2.054707 22.878571 661.8357 20.12286 286.5699 18.572857 16.26143
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.001231960 -0.0061488634
## zn 0.009459824 -0.0014772000
## indus 0.028954393 -0.0204754827
## chas 0.083882416 -0.1920615150
## nox 1.688279455 3.5675807263
## rm -0.052904794 -0.0740868296
## age -0.003403119 -0.0009436915
## dis -0.133340151 -0.0756860539
## rad 0.128615007 -0.3498011613
## tax 0.021536890 0.0164145869
## ptratio 0.055921351 -0.0639644060
## black -0.002807248 0.0001461840
## lstat -0.001948397 -0.0002527576
## medv 0.009225193 0.0149027194
##
## Proportion of trace:
## LD1 LD2
## 0.971 0.029
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
Variable nox (nitrogen oxides concentration) seems to be the most influential.